Zürich Statistical Office collects data on city and its residents. This data is published as Linked Data.
In this tutorial, we will show how to work with Linked Data. Mainly, we will see how to work with data on economic activities.
We will look into how to query, process, and visualize it.
Data on some economic activities is published as Linked Data. It can be accessed with SPARQL queries.
You can send queries using HTTP requests. The API endpoint is https://ld.stadt-zuerich.ch/query/.
Let's use SparqlClient from graphly to communicate with the database.
Graphly will allow us to:
pandas or geopandas# Installing dependencies for Colab environment
!pip install mapclassify
!pip install git+https://github.com/zazuko/graphly.git
import mapclassify
import matplotlib
import matplotlib.cm
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from graphly.api_client import SparqlClient
ENDPOINT = "https://ld.stadt-zuerich.ch/query/"
sparql = SparqlClient(ENDPOINT)
sparql.add_prefixes({
"schema": "<http://schema.org/>",
"cube": "<https://cube.link/>",
"property": "<https://ld.stadt-zuerich.ch/statistics/property/>",
"measure": "<https://ld.stadt-zuerich.ch/statistics/measure/>",
"skos": "<http://www.w3.org/2004/02/skos/core#>",
"ssz": "<https://ld.stadt-zuerich.ch/statistics/>"
})
SPARQL queries can become very long. To improve the readibility, we will work wih prefixes.
Using add_prefixes method, we define persistent prefixes.
Every time you send a query, graphly will add automatically update the prefixes for you.
Let's find the number of restaurants in Zurich over time. This information is available in AST-BTA data cube. To give restaurants numbers a context, let's scale them by population size. The number of inhabitants over time can be found in BEW data cube.
The query for number of inhabitants and restaurants over time is:
query = """
SELECT *
FROM <https://lindas.admin.ch/stadtzuerich/stat>
WHERE {
{
SELECT ?time (SUM(?ast) AS ?restaurants)
WHERE {
ssz:AST-BTA a cube:Cube;
cube:observationSet/cube:observation ?obs_rest.
?obs_rest property:TIME ?time ;
property:RAUM <https://ld.stadt-zuerich.ch/statistics/code/R30000> ;
property:BTA <https://ld.stadt-zuerich.ch/statistics/code/BTA5000> ;
measure:AST ?ast .
}
GROUP BY ?time ?place
}
{
SELECT ?time ?pop
WHERE {
ssz:BEW a cube:Cube;
cube:observationSet/cube:observation ?obs_pop.
?obs_pop property:TIME ?time ;
property:RAUM <https://ld.stadt-zuerich.ch/statistics/code/R30000>;
measure:BEW ?pop
}
}
}
ORDER BY ?time
"""
df = sparql.send_query(query)
df.head()
| time | restaurants | pop | |
|---|---|---|---|
| 0 | 1934-12-31 | 1328.0 | 315864.0 |
| 1 | 1935-12-31 | 1327.0 | 317157.0 |
| 2 | 1936-12-31 | 1321.0 | 317712.0 |
| 3 | 1937-12-31 | 1321.0 | 318926.0 |
| 4 | 1938-12-31 | 1334.0 | 326979.0 |
Let's calculate number of restaurants per 10 000 inhabitants
df = df.fillna(method="ffill")
df["Restaurants per 10 000 inhabitants"] = df["restaurants"]/df["pop"]*10000
fig = px.line(df, x="time", y = "Restaurants per 10 000 inhabitants")
fig.update_layout(title_text='Restaurants in Zürich over time', title_x=0.5)
Let's find the number of restaurants in different part of the city. The data on restaurants is available in AST-BTA data cube. To place the quertiers on the map, we will need their geographic coordinates. This data is available in Wikidata. We will get number of restaurants per district from our endpoint, and quertier centroid from Wikidata.
Both information can be obtained using SPARQL federated query. The endpoint for Wikidata is <https://query.wikidata.org/sparql>.
The query for quertiers, its centroids, and number of restaurants is:
query = """
PREFIX p: <http://www.wikidata.org/prop/>
PREFIX ps: <http://www.wikidata.org/prop/statement/>
SELECT ?place ?geometry (SUM(?ast) AS ?restaurants)
WHERE {
ssz:AST-BTA a cube:Cube;
cube:observationSet/cube:observation ?obs.
?obs property:TIME ?time ;
property:RAUM ?place_uri ;
property:BTA/schema:name ?bta ;
measure:AST ?ast .
?place_uri skos:inScheme <https://ld.stadt-zuerich.ch/statistics/scheme/Quartier> ;
schema:name ?place ;
schema:sameAs ?wikidata_id .
FILTER (?time = "2017-12-31"^^xsd:date)
BIND(IRI(?wikidata_id ) AS ?wikidata_iri ) .
SERVICE <https://query.wikidata.org/sparql> {
?wikidata_iri p:P625/ps:P625 ?geometry .
}
FILTER (?bta = "Verpflegungsbetriebe")
}
GROUP BY ?place ?geometry ?time
"""
df = sparql.send_query(query)
df.head()
| place | geometry | restaurants | |
|---|---|---|---|
| 0 | Hochschulen | POINT (8.54806 47.37640) | 62.0 |
| 1 | Oberstrass | POINT (8.54842 47.38434) | 31.0 |
| 2 | Rathaus | POINT (8.54503 47.37211) | 182.0 |
| 3 | Gewerbeschule | POINT (8.53174 47.38470) | 121.0 |
| 4 | Saatlen | POINT (8.56408 47.41019) | 10.0 |
Let's classify the number of restaurants into 5 different buckets. We will use mapclassify library to assign values in restaurant column into one of five categories.
N_CATEGORIES = 5
df["text"] = df.place + "<br>Restaurants: " + df.restaurants.astype(int).astype(str)
classifier = mapclassify.NaturalBreaks(y=df["restaurants"], k=N_CATEGORIES)
df["rest_buckets"] = df[["restaurants"]].apply(classifier)
Classified values can be easily visualized on the map.
norm = matplotlib.colors.Normalize(vmin=0, vmax=N_CATEGORIES)
colormap = matplotlib.cm.ScalarMappable(norm=norm, cmap=matplotlib.cm.viridis)
labels = mapclassify.classifiers._get_mpl_labels(classifier, fmt="{:.0f}")
fig = go.Figure()
for bucket in range(N_CATEGORIES):
subset = df[df.rest_buckets == bucket]
fig.add_trace(go.Scattermapbox(
mode="markers",
lat=subset.geometry.y,
lon=subset.geometry.x,
hovertext = subset.text,
hoverinfo = "text",
name=labels[bucket],
marker={'size': ((subset.restaurants)**1.5)*0.6, "sizemode": "area", "sizemin": 4, "color": "rgba{}".format(colormap.to_rgba(bucket+1))},
))
fig.update_layout(
margin={'l': 0, 't': 50, 'b': 0, 'r': 0},
mapbox={
'center': {"lat": 47.3815, "lon": 8.532},
'style': "carto-darkmatter",
'zoom': 11},
showlegend=True,
legend_title="Restaurants count",
title_text='Restaurants in Zürich Quartiers',
title_x=0.5
)
fig.show("notebook")
Let's take a look at gender representation in public sector. In BES-BTA-SEX data cube we can find information on number of employees in different organizations. The data is reported separately for each sex, and various establishment types. Let's find the number of male and female employees in after-school care (Hort).
The query for number of female and male employees in after-school care over time is:
query = """
SELECT ?time ?employees ?sex
FROM <https://lindas.admin.ch/stadtzuerich/stat>
WHERE {
ssz:BES-BTA-SEX a cube:Cube;
cube:observationSet/cube:observation ?obs.
?obs property:TIME ?time ;
property:RAUM/skos:inScheme <https://ld.stadt-zuerich.ch/statistics/scheme/Gemeinde> ;
property:BTA/schema:name "Horte" ;
property:SEX/schema:name ?sex ;
measure:BES ?employees .
}
ORDER BY ?time
"""
df = sparql.send_query(query)
df.head()
| time | employees | sex | |
|---|---|---|---|
| 0 | 1966-06-30 | 1.0 | männlich |
| 1 | 1966-06-30 | 86.0 | weiblich |
| 2 | 1967-06-30 | 1.0 | männlich |
| 3 | 1967-06-30 | 86.0 | weiblich |
| 4 | 1968-06-30 | 87.0 | weiblich |
Let's rearrange and rename the columns:
df = pd.pivot_table(df, index="time", columns="sex", values="employees")
df = df.reset_index().rename_axis(None, axis=1)
df = df.rename(columns={"männlich": "male", "weiblich": "female"})
fig = px.histogram(df, x="time", y=df.columns, barnorm="percent", labels={'x':'total_bill', 'y':'count'})
fig.update_layout(
title='After-school care: gender representation',
title_x=0.5,
yaxis_title="% of employees"
)
fig.show("notebook")